Introduction¶

In this project, I am going to be analyzing Baltimore Crime data, attempting to infer major factors for crime, including where they tend to occur with respect to geographic features.

Why is this worth looking into?¶

According to neighborhoodscout.com, Baltimore has a "Total Crime Index" of 3, meaning that is only safer than roughly 3% of neighborhoods across the United States. This is an absurd statistic; why is the most populous city in Maryland (and by a long shot too) such a dangerous place to live? In order to cut down on crime, we must first understand why it is occurring so that we might address the underlying issues at their source. Obviously, "why does crime occur" is an immensely complex question which would require millions of dollars of research in order to answer concretely. However, what we can do is draw correlations between crime and different suggesting factors, breaking down the overall problem into pieces that are more directly addressable. Armed with this knowledge, the police can more intelligently allocate their resources to areas with higher crime rates, and may be able to predict how city-level changes will affect crime in the future. For example, if we found that proximity to car dealerships had a significant effect on crime rates in the surrounding area, we could pre-emptively increase police presence in an area with an upcoming dealership to deter crime before it occurs. Overall, by analyzing existing crime data, we may be able to draw conclusions that will reduce crime in the future.

Explored Factors¶

Before we get into our code, we will determine a list of factors to look into The source I pulled the factors from will be linked above each set of factors.

  • NCSBI - Crime Factors

    1. population density
    2. median income
  • Crime deterrent effect of police stations by Fondevila et al. 3. proximity to police stations

  • The Association between Density of Alcohol Establishments and Violent Crime within Urban Neighborhoods by Toomey et al. 4. proximity to bars

  • Baltimore Liquor Stores Linked More to Violent Crime Than Bars and Restaurants, published by the John Hopkins Bloomberg School of Public Health 5. proximity to liquor stores

  • Understanding the Criminogenic Properties of Vacant Housing: A Mixed Methods Approach by Porter et al. (this is a UMD paper!) 6. proximity to vacant buildings

  • Exploring the relationship between strip clubs and rates of sexual violence and violent crime | Request PDF 7. proximity to strip clubs

Rejected Factors¶

  1. proximity to parks
    • according to the authors of Can parks help cities fight crime?, certain parks increase crime severely, and certain crimes decrease it just as much
    • therefore, this would not be a good factor to analyze
  2. presence of gangs

General Setup¶

Let's import the core libraries that we will be using in our project.

In [1]:
import requests

import pandas as pd
import plotly.express as px
import geopandas as gpd
from shapely.geometry import Point

from bs4 import BeautifulSoup
import re
from time import sleep

Data Collection¶

Setup¶

We want all of our data to lie in the data/ subdirectory.

First, let's make a SPEC matching each dataset/topic to file(s) that they will be saved in.

In [14]:
data_files = [
    'crime-data',
    'crime-codes',

    'police-stations',
    'population-density',
    'median-income',
    'liquor-licenses',

    'vacant-buildings',
    'baltimore-csa'
]

We only want to download the data once, so that future re-runs do not result in downloading it again. To accomplish this, we will create a version control system for our data. The benefit of this is that when we change our computation logic, we will not have to redownload any data. First, let's put data corresponding to each tag in its own subdirectory.

In [15]:
def data_directory(tag) -> str:
    directory = f'data/{tag}'

    try:
        os.mkdir(directory)
    except:
        pass

    return directory

Next, this function will, given the "tag" of the topic and a version, yield a filepath for the data. For example, data_filepath('crime-data', 0) would yield data/crime-data/v0.csv.

In [2]:
def data_filepath(tag, version=0, extension=None) -> str:
    if tag not in data_files:
        raise Exception(f'Invalid tag: {tag}')

    directory = data_directory(tag)

    if extension is None:
        # then, autodetect extension
        available_files = os.listdir(directory)
        found_file = next((filename for filename in available_files if filename.startswith(f'v{version}')))
        _, extension = path.splitext(found_file)
    else:
        extension = f'.{extension}'

    return f'{directory}/v{version}{extension}'

Helper Functions¶

First, let's write some functions that we'll be reusing throughout our project.

Rate-limiting¶

Let's write a higher-order function that, given an Iterable, will return a Generator which will wait a certain amount of time between yielding elements.

In [3]:
from typing import Iterable, Generator

def delay_iterable(iterable: Iterable, delay_seconds=0.5) -> Generator:
    is_first_element = True
    for element in iterable:
        if not is_first_element:
            sleep(delay_seconds)
        else:
            is_first_element = False

        yield element

Let's write another function for simply delaying repeated function calls.

In [4]:
def delay_fn(f, delay_seconds=0.5):
    is_first_time = True

    def delayed_fn(*args, **kwargs):
        nonlocal f, is_first_time

        if not is_first_time:
            sleep(delay_seconds)
        else:
            is_first_time = False

        return f(*args, **kwargs)

    return delayed_fn

Caching Data¶

We don't want to refetch our data every time we run the notebook, so let's write some helper functions that allow us to cache our data after we download it.

In [5]:
from os import path

def download_cached(url: str, filepath: str):
    # if the file already exists, does nothing.
    # otherwise, downloads to filepath
    if not path.exists(filepath):
        with open(filepath, 'w+') as f:
            f.write(requests.get(url).text)

Normalizing Data¶

This function will normalize a Series between 0 and 1.

In [6]:
def normalize_min_max(series: 'Series') -> 'Series':
    return (series - series.min()) / (series.max() - series.min())

Loading and Exporting Data¶

Let's write a generalized loading function which will determine how to read a file based on the extension. By default, it will get the freshest version of the data by finding the highest version of the data in its data directory. This will allow us to forget the details of how we collected the data and instead focus on the data itself.

In [7]:
import os
from os import path

def latest_data_filepath(key) -> str:
    data_dir = data_directory(key)
    versions = os.listdir(data_dir)

    # return the lexicographically highest one
    return f'{data_dir}/{max(versions)}'

def generic_load(key, version=None) -> object:
    """
    If `version` is specified, gets that specific version of the data.
    """
    file_path = data_filepath(key, version) if version is not None else latest_data_filepath(key)

    # then, fallback to static data
    _, extension = path.splitext(file_path)
    if extension == '.csv':
        return pd.read_csv(file_path)
    elif extension == '.json':
        return pd.read_json(file_path)
    elif extension == '.geoparquet':
        return gpd.read_parquet(file_path)
    elif extension == '.geojson':
        return gpd.read_file(file_path)
    elif extension == '.parquet':
        return pd.read_parquet(file_path)
    else:
        raise Exception(f'Invalid format: {extension}')

Next, let's write a generalized exporting function which will allow us to export our dataframes to the filesystem. For GeoDataFrame types, we specifically use the parquet data format since it is significantly more space-efficient than geojson. The reason we use parquet over csv for regular DataFrames is because when reading from csv, some "special" fields like DateTime's are not read into the proper types.

In [8]:
def generic_export(key, data: 'DataFrame', version):
    extension = 'geoparquet' if isinstance(data, gpd.GeoDataFrame) else 'parquet'
    path = data_filepath(key, version=version, extension=extension)
    data.to_parquet(path)

Google Places API¶

We will need to use the Google Maps Place Search API at several points in our Data Collection for geocoding and categorizing addresses. Let's write a function which will yield the URL and get parameters to make a request to the Places API, given a search query.

In [9]:
# unlike requests, this package allows us to do requests in parallel
find_place_from_text_endpoint = r'https://maps.googleapis.com/maps/api/place/findplacefromtext/json'

def google_find_place_from_text_params(search_query):
    get_params = {
        'input': search_query,
        'inputtype': 'textquery',
        'fields': 'type,geometry,name,place_id',
        'key': os.environ['GOOGLE_API_KEY']
    }

    # url, kwargs
    return (find_place_from_text_endpoint, get_params)

This is a function which will extract a geometry component out of the response to a find place from text request.

In [10]:
from shapely.geometry import Point

def extract_geometry_from_places_request(response):
    try:
        location = response['candidates'][0]['geometry']['location']
        return Point(location['lng'], location['lat'])
    except:
        return None

Now, let's write a function which will run a list of requests in parallel.

In [11]:
from requests_futures.sessions import FuturesSession
from concurrent.futures import as_completed

def run_requests(request_param_list: 'List[Tuple[str, dict]]'):
    session = FuturesSession(max_workers=4)

    futures = []
    for (index, (url, get_params)) in enumerate(request_param_list):
        future = session.get(url, params=get_params)
        future.index = index
        futures.append(future)

    resolved_futures = list(as_completed(futures))

    # sort by index
    resolved_futures.sort(key=lambda f: f.index)

    # return the results
    return list(map(lambda f: f.result().json(), resolved_futures))

Let's also define a helper for mapping a function over the rows of a DataFrame.

In [12]:
def map_df(function, df):
    return (function(row) for row in df.to_dict(orient='records'))

One-time Data Collection¶

Baltimore Community Statistical Areas¶

Let's get the bounding boxes of Baltimore's "Community Statistical Areas", which are "clusters of neighborhoods developed by the City's Planning Department based on recognizable city neighborhoods" (Source).

By using download_cached, we will only download the file if it does not already exist.

In [16]:
download_url = r'https://opendata.arcgis.com/api/v3/datasets/9c96ae20e6cc41258015c2fd288716c4_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1'
download_filepath = data_filepath('baltimore-csa', version=0, extension='geojson')

download_cached(download_url, download_filepath)

Let's load the data and see what columns we have to work with.

In [17]:
baltimore_csa_df = generic_load('baltimore-csa')
print(baltimore_csa_df.columns)
Index(['FID', 'Community', 'Neigh', 'Tracts', 'Link', 'geometry'], dtype='object')

Now, let's plot our data, highlighting the different communities.

In [18]:
baltimore_csa_df.plot(column='Community')
Out[18]:
<Axes: >

Crime Data¶

In this section, we are going to use the Baltimore Crime Data dataset, published by the city of Baltimore itself. Let's start off by downloading the data.

In [19]:
download_url = r'https://download.data.world/file_download/baltimore/baltimore-crime-data/BPD_Part_1_Victim_Based_Crime_Data.csv'

download_filepath = data_filepath('crime-data', 0)
download_cached(download_url, download_filepath)

Now, let's look at the data and see how we can clean it up.

In [20]:
crime_df = pd.read_csv(download_filepath)
print(crime_df.head())
    CrimeDate CrimeTime CrimeCode          Location Description   Weapon   
0  06/18/2016  00:33:00        4E  2700 CHESLEY AVE           I    HANDS  \
1  06/18/2016  00:39:00        4B     2700 FAIT AVE           O    KNIFE   
2  06/18/2016      0015        9S   2400 CYLBURN AV     Outside  FIREARM   
3  06/18/2016  01:53:00       3AF   2300 ORLEANS ST           O  FIREARM   
4  06/18/2016  02:05:00        6C    800 N WOLFE ST           I      NaN   

    Post      District        Neighborhood                       Location 1   
0  424.0  NORTHEASTERN  North Harford Road  (39.3679000000, -76.5555900000)  \
1  232.0  SOUTHEASTERN              Canton  (39.2831500000, -76.5783400000)   
2  532.0      NORTHERN           Levindale  (39.3510400000, -76.6597600000)   
3  221.0  SOUTHEASTERN      McElderry Park  (39.2955600000, -76.5844600000)   
4  321.0       EASTERN         Middle East  (39.3002700000, -76.5909700000)   

   Total Incidents  
0                1  
1                1  
2                1  
3                1  
4                1  

First, let's convert CrimeDate and CrimeTime, both currently strings, into a single CrimeDateTime column. Looking at the data, we have certain CrimeTime entries which seem to have the format <hour><minute> instead of the regular <hour>:<minute>:<seconds> format. Before we convert, we have to first normalize the outliers.

In [21]:
bad_time_format = re.compile('^(\d{2})(\d{2})$')
crime_df['CrimeTime'] = crime_df['CrimeTime'].str.replace(bad_time_format, r'\1:\2:00', regex=True)

Let's make sure that all of the rows match our expected regex by printing out all of the times that don't match what we're expecting.

In [22]:
expected_time_format = re.compile('^\d{2}:\d{2}:\d{2}$')
print(crime_df[~crime_df['CrimeTime'].str.fullmatch(expected_time_format)]['CrimeTime'])
Series([], Name: CrimeTime, dtype: object)

The list is empty, so we're good. Now, let's do the DateTime conversion.

In [23]:
crime_time = pd.to_timedelta(crime_df['CrimeTime'])
crime_df['CrimeDateTime'] = pd.to_datetime(crime_df['CrimeDate'], format='%m/%d/%Y') + crime_time
print(crime_df.head())
    CrimeDate CrimeTime CrimeCode          Location Description   Weapon   
0  06/18/2016  00:33:00        4E  2700 CHESLEY AVE           I    HANDS  \
1  06/18/2016  00:39:00        4B     2700 FAIT AVE           O    KNIFE   
2  06/18/2016  00:15:00        9S   2400 CYLBURN AV     Outside  FIREARM   
3  06/18/2016  01:53:00       3AF   2300 ORLEANS ST           O  FIREARM   
4  06/18/2016  02:05:00        6C    800 N WOLFE ST           I      NaN   

    Post      District        Neighborhood                       Location 1   
0  424.0  NORTHEASTERN  North Harford Road  (39.3679000000, -76.5555900000)  \
1  232.0  SOUTHEASTERN              Canton  (39.2831500000, -76.5783400000)   
2  532.0      NORTHERN           Levindale  (39.3510400000, -76.6597600000)   
3  221.0  SOUTHEASTERN      McElderry Park  (39.2955600000, -76.5844600000)   
4  321.0       EASTERN         Middle East  (39.3002700000, -76.5909700000)   

   Total Incidents       CrimeDateTime  
0                1 2016-06-18 00:33:00  
1                1 2016-06-18 00:39:00  
2                1 2016-06-18 00:15:00  
3                1 2016-06-18 01:53:00  
4                1 2016-06-18 02:05:00  

Let's drop any rows which do not contain Location's, since this is our most important feature.

In [24]:
crime_df = crime_df.drop(crime_df[crime_df['Location 1'].isnull()].index)

Let's also convert our Location column into shapely Point objects. Note that because geopandas expectes x to correspond to longitude and y to latitude, we have to reverse the initial order.

In [25]:
geometry = crime_df['Location 1'].str.extract('\((-?\d+\.\d+), (-?\d+\.\d+)\)', expand=True)
crime_df['geometry'] = gpd.points_from_xy(geometry[1].astype('float64'), geometry[0].astype('float64'))
crime_df = gpd.GeoDataFrame(crime_df)

Now, let's look at a Scatterplot of our data.

In [26]:
crime_df.plot()
Out[26]:
<Axes: >

It seems that there is a cluster of points which lie far away from the rest of the points. Let's see what's going on with them.

In [27]:
outliers = crime_df[crime_df['geometry'].y > 41]
print(outliers)
        CrimeDate CrimeTime CrimeCode                  Location Description   
48     06/18/2016  14:00:00        6E          2800 ELLICOTT DY           O  \
579    06/13/2016  04:00:00        5D               800 BOYD ST           I   
726    06/12/2016  13:00:00        6E          4100 HAYWARD AVE           O   
768    06/12/2016  20:39:00        4E         AV & GLENFALLS AV           O   
997    06/10/2016  14:35:00       3AF             ST & TOONE ST           O   
...           ...       ...       ...                       ...         ...   
22353  12/14/2015  11:30:00        5A  3300 LIBERTY HEIGHTS AVE           I   
22541  12/13/2015  23:20:00        4E        3200 O''DONNELL ST           I   
22725  12/11/2015  01:01:00        7A             0 E OSTEND ST           O   
26154  11/15/2015  10:20:00        4A          300 E MADISON ST           I   
32009  10/03/2015  11:23:00        6C        4900 FREDERICK AVE           I   

        Weapon   Post      District    Neighborhood   
48         NaN  814.0  SOUTHWESTERN      Winchester  \
579        NaN  931.0      SOUTHERN  Hollins Market   
726        NaN  634.0  NORTHWESTERN        Woodmere   
768      HANDS  425.0  NORTHEASTERN  Glenham-Belhar   
997    FIREARM  233.0  SOUTHEASTERN         Medford   
...        ...    ...           ...             ...   
22353      NaN  642.0  NORTHWESTERN     Forest Park   
22541    HANDS  232.0  SOUTHEASTERN          Canton   
22725      NaN  942.0      SOUTHERN    Federal Hill   
26154  FIREARM  312.0       EASTERN   Penn-Fallsway   
32009      NaN  833.0  SOUTHWESTERN      Beechfield   

                            Location 1  Total Incidents       CrimeDateTime   
48     (41.5286100000, -76.6537300000)                1 2016-06-18 14:00:00  \
579    (41.5553400000, -76.6178600000)                1 2016-06-13 04:00:00   
726    (41.5102000000, -76.6784100000)                1 2016-06-12 13:00:00   
768    (41.6228300000, -76.5271200000)                1 2016-06-12 20:39:00   
997    (41.6213600000, -76.5291100000)                1 2016-06-10 14:35:00   
...                                ...              ...                 ...   
22353  (41.5228600000, -76.6614400000)                1 2015-12-14 11:30:00   
22541  (41.6016700000, -76.5556000000)                1 2015-12-13 23:20:00   
22725  (41.5677600000, -76.6011900000)                1 2015-12-11 01:01:00   
26154  (41.5707600000, -76.5971500000)                1 2015-11-15 10:20:00   
32009  (41.5043600000, -76.6862300000)                1 2015-10-03 11:23:00   

                         geometry  
48     POINT (-76.65373 41.52861)  
579    POINT (-76.61786 41.55534)  
726    POINT (-76.67841 41.51020)  
768    POINT (-76.52712 41.62283)  
997    POINT (-76.52911 41.62136)  
...                           ...  
22353  POINT (-76.66144 41.52286)  
22541  POINT (-76.55560 41.60167)  
22725  POINT (-76.60119 41.56776)  
26154  POINT (-76.59715 41.57076)  
32009  POINT (-76.68623 41.50436)  

[116 rows x 13 columns]

After observing a couple of the addresses, it seems that the geometry field was input incorrectly, but the addresses are regular. To fix the geometry, we'll use the Google Places API for geocoding.

In [28]:
def find_place_with_address(row):
    return google_find_place_from_text_params(f"{row['Location']}, Baltimore")

geocoded_results = run_requests(map_df(find_place_with_address, outliers))
print(geocoded_results[0])
{'candidates': [{'geometry': {'location': {'lat': 39.2879496, 'lng': -76.6131106}, 'viewport': {'northeast': {'lat': 39.28929942989272, 'lng': -76.61176077010728}, 'southwest': {'lat': 39.28659977010728, 'lng': -76.61446042989272}}}, 'name': 'Ellicott St', 'place_id': 'ChIJxce0HJ4EyIkRIUuN8vhOQWQ', 'types': ['route']}], 'status': 'OK'}

Now, let's merge our results back into the parent DataFrame.

In [29]:
crime_df.loc[outliers.index, 'geometry'] = list(map(extract_geometry_from_places_request, geocoded_results))

Let's see if any of the geolocations failed.

In [30]:
failed_geolocations = crime_df[crime_df['geometry'].isnull()]
print(len(failed_geolocations))
2

We will drop these rows, since they likely correspond to misinputs.

In [31]:
crime_df = crime_df.drop(failed_geolocations.index)

Now, let's plot our data again and make sure it looks correct.

In [32]:
crime_df.plot()
Out[32]:
<Axes: >

This looks a lot more like Baltimore!

Finally, let's drop the initial Date and Time columns as well as the Location column, and write to our dynamic data file. As a short aside, this dataset takes almost 100M as a geojson file, while only taking 7M using the parquet format.

In [33]:
crime_df = crime_df.drop(['CrimeDate', 'CrimeTime', 'Location 1'], axis=1)
generic_export('crime-data', crime_df, version=1)

Crime Codes¶

CrimeCode seems important, but in its current form it's not very useful. To know what each crime code means, let's download the companion CRIME CODES dataset.

In [34]:
download_url = r'https://www.arcgis.com/sharing/rest/content/items/e6ca4eadecdc475a961f68bc314e2a86/data'
download_filepath = data_filepath('crime-codes', version=0)

download_cached(download_url, download_filepath)

Let's see what the data looks like before moving on.

In [35]:
crime_codes_df = pd.read_csv(download_filepath)
print(crime_codes_df.head())
  CODE  TYPE                  NAME   CLASS    NAME_COMBINE   WEAPON   
0   13  CTYP  ASSIST OFFICER           CFS  ASSIST OFFICER      NaN  \
1   1A  CTYP  MURDER                PART 1        HOMICIDE    OTHER   
2   1F  CTYP  MURDER                PART 1        HOMICIDE  FIREARM   
3   1K  CTYP                MURDER  PART 1        HOMICIDE    KNIFE   
4   1O  CTYP                MURDER  PART 1        HOMICIDE    OTHER   

  VIOLENT_CR VIO_PROP_CFS  
0        NaN        OTHER  
1   HOMICIDE      VIOLENT  
2   HOMICIDE      VIOLENT  
3   HOMICIDE      VIOLENT  
4   HOMICIDE      VIOLENT  

Police Stations¶

There are 9 districts in Baltimore, corresponding to the 4 cardinal directions, 4 in-betweens and one central district. To get the locations of the police stations in Baltimore, we will webscrape https://www.baltimorepolice.org/find-my-district, get the addresses of each station, and then use geopy to get the lat/long from each address.

First, let's set a constant for our base URL, and abstract out our directions into lists.

In [36]:
base_url = 'https://www.baltimorepolice.org/find-my-district'

vertical_directions = ['north', 'south']
horizontal_directions = ['east', 'west']

Let's start by setting our central station.

In [37]:
stations = ['central']

Now, let's add in each compass direction, appending an "ern" to the end of each one, i.e "east" becomes "eastern".

In [38]:
for direction in vertical_directions + horizontal_directions:
    stations.append(f'{direction}ern')

Next, we'll add the compound directions, which are formed by joining a vertical and horizontal direction, followed by "ern" like before.

In [39]:
for vertical in vertical_directions:
    for horizontal in horizontal_directions:
        stations.append(f'{vertical}{horizontal}ern')

Now that we have a list of all of our stations, let's make a dictionary mapping each station to its address. First, let's write a function that will lookup the address of a single station.

In [40]:
address_pattern = re.compile(r'Address: (.+)')

def police_lookup_address(station: str) -> str:
    r = requests.get(f'{base_url}/{station}-district')
    soup = BeautifulSoup(r.text)
    combined_text = soup.get_text()
    search_result = address_pattern.search(combined_text)

    # return the first capture group
    return search_result.group(1)

Now, let's make a DataFrame for our stations.

In [41]:
stations_df = pd.DataFrame.from_dict({'station': stations})

Let's add a row for the address of each station.

In [42]:
stations_df['address'] = stations_df.apply(delay_fn(lambda row: police_lookup_address(row.station)), axis=1)
print(stations_df)
        station                                          address
0       central        501 N. Calvert Street Baltimore, MD 21202
1      northern  2201 W. Cold Spring Lane, Baltimore, Md., 21215
2      southern         10 Cherry Hill Road, Baltimore, MD 21225
3       eastern         1620 Edison Highway, Baltimore, MD 21213
4       western        1034 N. Mount Street, Baltimore, MD 21217
5  northeastern          1900 Argonne Drive, Baltimore, MD 21218
6  northwestern      5271 Reisterstown Road, Baltimore, MD 21215
7  southeastern         5710 Eastern Avenue, Baltimore, MD 21224
8  southwestern        424 Font Hill Avenue, Baltimore, MD 21223

Next, let's use geopandas to convert each one of those addresses into a latitude and longitude.

In [43]:
stations_geocoded = gpd.tools.geocode(stations_df.address)
print(stations_geocoded)
                     geometry   
0  POINT (-76.61218 39.29542)  \
1  POINT (-76.57794 39.34498)   
2  POINT (-76.61718 39.25287)   
3  POINT (-76.57335 39.31002)   
4  POINT (-76.64474 39.30067)   
5  POINT (-76.58289 39.34082)   
6  POINT (-76.68513 39.34467)   
7  POINT (-76.58799 39.21552)   
8  POINT (-76.65676 39.28031)   

                                             address  
0  Central District Baltimore Police Department, ...  
1  2201, East Cold Spring Lane, 21214, East Cold ...  
2  Southern District, 10, Cherry Hill Road, 21225...  
3  Baltimore Police Department Eastern District, ...  
4  Western District Police Station, 1034, North M...  
5  Baltimore Police Department Northeast District...  
6  Northwest Police Station, 5271, Reisterstown R...  
7  5710, Pennington Avenue, 21226, Pennington Ave...  
8  424, Millington Avenue, 21223, Millington Aven...  

We don't need the station column anymore, and the geocoded address is superior (more detailed) to the original, so we will replace the initial dataframe with the new one entirely.

In [44]:
stations_df = stations_geocoded

Finally, let's write our data to the file specified in the SPEC.

In [45]:
generic_export('police-stations', stations_df, 1)

Population Density¶

For each district, we will have a coefficient representing population density.

First, let's get the Total Population dataset from Open Baltimore.

In [46]:
download_url = r'https://opendata.arcgis.com/api/v3/datasets/56d5b4e5480049e98315c2732aa48437_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1'

download_filepath = data_filepath('population-density', version=0)
download_cached(download_url, download_filepath)

Now, let's read it into a DataFrame.

In [47]:
populations_df = generic_load('population-density', version=0)
print(populations_df.head())
   OBJECTID                            CSA2010  tpop10  tpop20   Shape__Area   
0         1      Allendale/Irvington/S. Hilton   16217   14828  6.377046e+07  \
1         2    Beechfield/Ten Hills/West Hills   12264   12083  4.788253e+07   
2         3                      Belair-Edison   17416   15085  4.495003e+07   
3         4  Brooklyn/Curtis Bay/Hawkins Point   14243   13479  1.760777e+08   
4         5                             Canton    8100    8239  1.540854e+07   

   Shape__Length                                           geometry  
0   38770.165571  POLYGON ((-76.65726 39.27600, -76.65726 39.276...  
1   37524.950533  POLYGON ((-76.69479 39.30201, -76.69465 39.301...  
2   31307.314843  POLYGON ((-76.56761 39.32636, -76.56746 39.326...  
3  150987.703639  MULTIPOLYGON (((-76.58867 39.21283, -76.58824 ...  
4   23338.611948  POLYGON ((-76.57140 39.28441, -76.57138 39.284...  

We will use the average of the population value from 2010 and 2020.

In [48]:
populations_df = populations_df.assign(density=lambda df: (df['tpop10'] + df['tpop20']) / df['Shape__Area'])

Let's normalize the density values between -1 and 1, since the actual values themselves are less important than the values with relation to one another. We are using a variant of min-max normalization that puts values between -1 and 1 rather than 0 and 1.

In [49]:
populations_df['density'] = normalize_min_max(populations_df['density'])

Let's now drop our unneeded tpop columns.

In [50]:
populations_df = populations_df.drop(['tpop10', 'tpop20'], axis=1)

Finally, let's export our dataframe.

In [51]:
generic_export('population-density', populations_df, version=1)

Median Income¶

For Median Household Income, we will use the Median Household Income dataset from Open Baltimore.

In [52]:
download_url = r'https://opendata.arcgis.com/api/v3/datasets/8613366cfbc7447a9efd9123604c65c1_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1'

download_filepath = data_filepath('median-income', version=0)
download_cached(download_url, download_filepath)

Now, let's read it into a DataFrame.

In [53]:
median_income_df = generic_load('median-income', version=0)
print(median_income_df.head())
   OBJECTID                            CSA2010    mhhi10        mhhi11   
0         1      Allendale/Irvington/S. Hilton  33563.12  33504.324121  \
1         2    Beechfield/Ten Hills/West Hills  50780.92  50439.739513   
2         3                      Belair-Edison  42920.83  45149.372510   
3         4  Brooklyn/Curtis Bay/Hawkins Point  32888.50  33644.052189   
4         5                             Canton  74685.14  82129.569175   

         mhhi12        mhhi13        mhhi14        mhhi15         mhhi16   
0  33177.658915  38129.073308  35958.253351  36701.906742   37302.171053  \
1  50135.121622  49807.861765  52622.688525  51537.582075   53565.079698   
2  46743.281847  43903.901338  38905.924198  38173.968254   40482.359649   
3  33526.364650  34419.965251  35861.896552  36679.053435   38603.930233   
4  84978.141631  90862.712924  91735.652736  95362.400309  103281.832192   

          mhhi17         mhhi18        mhhi19   
0   39495.628472   38535.562176   43019.75792  \
1   57572.502747   58055.306613   55017.77971   
2   39624.482085   42633.619512   46703.93468   
3   40275.275330   39936.512500   39162.13858   
4  111891.251825  116911.088235  128460.48210   

                             CSA2020        mhhi20        mhhi21   
0      Allendale/Irvington/S. Hilton   42714.60327   42714.60327  \
1    Beechfield/Ten Hills/West Hills   55435.50291   55435.50291   
2                      Belair-Edison   50010.43737   50010.43737   
3  Brooklyn/Curtis Bay/Hawkins Point   32598.81788   32598.81788   
4                             Canton  134208.76110  134208.76110   

    Shape__Area  Shape__Length   
0  6.377046e+07   38770.165571  \
1  4.788253e+07   37524.950533   
2  4.495003e+07   31307.314843   
3  1.760777e+08  150987.703639   
4  1.540854e+07   23338.611948   

                                            geometry  
0  POLYGON ((-76.65726 39.27600, -76.65726 39.276...  
1  POLYGON ((-76.69479 39.30201, -76.69465 39.301...  
2  POLYGON ((-76.56761 39.32636, -76.56746 39.326...  
3  MULTIPOLYGON (((-76.58867 39.21283, -76.58824 ...  
4  POLYGON ((-76.57140 39.28441, -76.57138 39.284...  

It gives us median household income from 2010 until 2021. For consistency, we will choose the middle year: 2016 as our standard.

In [54]:
median_income_df = median_income_df[['CSA2010', 'mhhi16', 'Shape__Area', 'Shape__Length', 'geometry']].rename(columns={'mhhi16': 'MedianIncome'})

Let's now normalize our MedianIncome column.

In [55]:
median_income_df['MedianIncome'] = normalize_min_max(median_income_df['MedianIncome'])

Finally, let's export the result.

In [56]:
generic_export('median-income', median_income_df, version=1)

Finding Liquor Stores, Bars and Strip Clubs¶

To get the list of liquor stores, bars and strip clubs, we will use the Liquor Licenses | Open Baltimore dataset. Note that we are operating on the assumption that all strip clubs have liquor licenses, but logically this seems like a fair assumption. Unfortunately, there is something wrong with the data export on the server side, so rather than simply downloading the data and filtering it locally, we need to use the SQL query interface to get the data.

In [57]:
query_endpoint = r'https://opendata.baltimorecity.gov/egis/rest/services/NonSpatialTables/Licenses/FeatureServer/0/query'

There are a couple things that will go into our query.

  1. We want to see establishments that are currently open over the course of our crime data. Let's find the latest date for our crime data.

    code
    # reload the data from the filesystem in case we haven't run the previous cells
    crime_df = generic_load('crime-data')
    print(crime_df.CrimeDateTime.max())
    

    Thus, we will filter for license end dates only after 06/18/2016, since this is when our data ends.

  2. We only want places that are liquor stores, bars or strip clubs; not restaurants or anything else. After querying the endpoint, I found that the following are the list of EstablishmentDesc categories, labelling what kind of place the license is for.

    1. Non-Profits only
    2. Restaurant License
    3. Municipal Golf Course
    4. Adult
    5. Distillery
    6. Tavern License
    7. Brewery
    8. Hotel/Motel
    9. Racetrack
    10. Package Goods Only
    11. Tavern
    12. Arena
    13. Restaurant
    14. Zoo License

    Of these options, we are interested in Tavern, which represent bars, Package goods only which represents liquor stores, and Adult, which is code for strip clubs. Surprisingly, Tavern License primarily belongs to restaurants, so we do not group it in with Tavern.

As described in the documentation, there is a limit on how many data entries we can return at once. However, if we enable the returnIdsOnly parameter, this limit does not apply, and we can get all of the ids at once.

In [58]:
query_params = {
    'where': f"LicenseEndDate > DATE '{crime_df.CrimeDateTime.max()}' AND EstablishmentDesc IN ('Tavern', 'Adult', 'Package goods only')",
    'returnIdsOnly': 'true',
    'f': 'geojson'
}

Getting the Relevant IDs¶

Unfortunately, making the request yields the error specified here: python - SSL error unsafe legacy renegotiation disabled - Stack Overflow, caused by a bad SSL setup on the server side. This is out of our control, and also out of the scope of this project, so I'm copy-pasting one of the answers from the post in order to make the request. Please pretend this code doesn't exist.

In [59]:
import urllib3
import ssl


class CustomHttpAdapter (requests.adapters.HTTPAdapter):
    # "Transport adapter" that allows us to use custom ssl_context.

    def __init__(self, ssl_context=None, **kwargs):
        self.ssl_context = ssl_context
        super().__init__(**kwargs)

    def init_poolmanager(self, connections, maxsize, block=False):
        self.poolmanager = urllib3.poolmanager.PoolManager(
            num_pools=connections, maxsize=maxsize,
            block=block, ssl_context=self.ssl_context)


def get_legacy_session():
    ctx = ssl.create_default_context(ssl.Purpose.SERVER_AUTH)
    ctx.options |= 0x4  # OP_LEGACY_SERVER_CONNECT
    session = requests.session()
    session.mount('https://', CustomHttpAdapter(ctx))
    return session

def legacy_request_get(*args, **kwargs):
    return get_legacy_session().get(*args, **kwargs)

Now that that's out of the way, let's make our request to get the list of IDs.

In [60]:
ids = legacy_request_get(query_endpoint, params=query_params).json()['properties']['objectIds']
print(ids[:5])
[1, 2, 3, 4, 5]

Paginating the IDs¶

Because there's a limit on getting the data (apart from the IDs), we need to paginate our requests so that we get a certain amount at a time. From my testing, it appears that the number of entries you can get at once is 50, so we will use that number going forward.

Let's disable where and returnIdsOnly since we now want the records from the IDs.

In [61]:
entries_query_params = query_params.copy()
entries_query_params['where'] = None
entries_query_params['returnIdsOnly'] = 'false'
entries_query_params['outFields'] = ','.join(['TradeName', 'EstablishmentDesc', 'AddrStreet', 'AddrZip'])

Let's run our scraping code.

In [62]:
liquor_df = None

entries_per_request = 250
for start_index in range(0, len(ids), entries_per_request):
    end_index = min(start_index + entries_per_request, len(ids))
    id_chunk = ids[start_index:end_index]
    entries_query_params['objectIds'] = ','.join(map(str, id_chunk))

    request_res = legacy_request_get(query_endpoint, params=entries_query_params).json()
    chunk_df = gpd.GeoDataFrame.from_features(request_res)
    if liquor_df is None:
        liquor_df = chunk_df
    else:
        liquor_df = pd.concat([liquor_df, chunk_df])

    sleep(0.5)

print(liquor_df)
    geometry                    TradeName EstablishmentDesc   
0       None  DIVING HORSE GENTLEMAN CLUB             Adult  \
1       None  DIVING HORSE GENTLEMAN CLUB             Adult   
2       None  DIVING HORSE GENTLEMAN CLUB             Adult   
3       None                     RED ROOM             Adult   
4       None                      GODDESS             Adult   
..       ...                          ...               ...   
209     None           TOM & OLIVE'S CAFE            Tavern   
210     None           TOM & OLIVE'S CAFE            Tavern   
211     None                  THE HANOVER            Tavern   
212     None                  THE HANOVER            Tavern   
213     None                  THE HANOVER            Tavern   

                       AddrStreet AddrZip  
0            5-11 COMMERCE STREET   21202  
1            5-11 COMMERCE STREET   21202  
2            5-11 COMMERCE STREET   21202  
3       411 BALTIMORE STREET EAST   21202  
4        36-38 EUTAW STREET SOUTH   21201  
..                            ...     ...  
209     2942 MONUMENT STREET EAST   21205  
210     2942 MONUMENT STREET EAST   21205  
211  3432-34 HANOVER STREET SOUTH   21225  
212  3432-34 HANOVER STREET SOUTH   21225  
213  3432-34 HANOVER STREET SOUTH   21225  

[7714 rows x 5 columns]

Let's reset the index since it's currently numbered strangely.

In [63]:
liquor_df = liquor_df.reset_index(drop=True)

Now that we have that data, let's write it as version 0 of our dataset.

In [64]:
generic_export('liquor-licenses', liquor_df, version=0)
/home/sridaran/Development/Python/Environments/Data/lib/python3.11/site-packages/geopandas/array.py:968: RuntimeWarning: All-NaN slice encountered
  np.nanmin(b[:, 0]),  # minx
/home/sridaran/Development/Python/Environments/Data/lib/python3.11/site-packages/geopandas/array.py:969: RuntimeWarning: All-NaN slice encountered
  np.nanmin(b[:, 1]),  # miny
/home/sridaran/Development/Python/Environments/Data/lib/python3.11/site-packages/geopandas/array.py:970: RuntimeWarning: All-NaN slice encountered
  np.nanmax(b[:, 2]),  # maxx
/home/sridaran/Development/Python/Environments/Data/lib/python3.11/site-packages/geopandas/array.py:971: RuntimeWarning: All-NaN slice encountered
  np.nanmax(b[:, 3]),  # maxy

We have 5139 rows, but unfortunately the endpoint did not give us our geometry. Therefore, we have to use geocoding once again.

From experimentation, it seems that geocoding fails for addresses that contain dashes in the numeric element, like 5-11 Foobar Street.

In [65]:
relevant_regex = re.compile('^(\d+)-\d+')
print(liquor_df[liquor_df.AddrStreet.str.match(relevant_regex)])
     geometry                    TradeName   EstablishmentDesc   
0        None  DIVING HORSE GENTLEMAN CLUB               Adult  \
1        None  DIVING HORSE GENTLEMAN CLUB               Adult   
2        None  DIVING HORSE GENTLEMAN CLUB               Adult   
4        None                      GODDESS               Adult   
5        None     WAGON WHEEL/PLAYERS CLUB               Adult   
...       ...                          ...                 ...   
7706     None         YESTERYEAR'S LIQUORS  Package goods only   
7707     None         YESTERYEAR'S LIQUORS  Package goods only   
7711     None                  THE HANOVER              Tavern   
7712     None                  THE HANOVER              Tavern   
7713     None                  THE HANOVER              Tavern   

                         AddrStreet AddrZip  
0              5-11 COMMERCE STREET   21202  
1              5-11 COMMERCE STREET   21202  
2              5-11 COMMERCE STREET   21202  
4          36-38 EUTAW STREET SOUTH   21201  
5           821-23 NORTH POINT ROAD   21237  
...                             ...     ...  
7706  212-214 HIGHLAND AVENUE NORTH   21224  
7707  212-214 HIGHLAND AVENUE NORTH   21224  
7711   3432-34 HANOVER STREET SOUTH   21225  
7712   3432-34 HANOVER STREET SOUTH   21225  
7713   3432-34 HANOVER STREET SOUTH   21225  

[1948 rows x 5 columns]

To fix this, we'll remove the second dash component in all of these rows.

In [66]:
liquor_df['AddrStreet'] = liquor_df['AddrStreet'].replace(relevant_regex, r'\1')
print(liquor_df[liquor_df.AddrStreet.str.match(relevant_regex)])
Empty GeoDataFrame
Columns: [geometry, TradeName, EstablishmentDesc, AddrStreet, AddrZip]
Index: []

Next, let's filter out all of the duplicate entries, since for many of the establishments we have several entries corresponding to separate liquor license renewals. First, we'll delete all completely duplicate rows.

In [67]:
liquor_df = liquor_df.drop_duplicates()
print(liquor_df)
     geometry                    TradeName   EstablishmentDesc   
0        None  DIVING HORSE GENTLEMAN CLUB               Adult  \
3        None                     RED ROOM               Adult   
4        None                      GODDESS               Adult   
5        None     WAGON WHEEL/PLAYERS CLUB               Adult   
8        None                   CIRCUS BAR               Adult   
...       ...                          ...                 ...   
7546     None             SIX PAX "N" MORE  Package goods only   
7644     None      ACE LIQUORS & GROCERIES  Package goods only   
7671     None                    KWIK MART              Tavern   
7673     None         FIREFLY FARMS MARKET  Package goods only   
7700     None                HAMPDEN YARDS              Tavern   

                            AddrStreet AddrZip  
0                    5 COMMERCE STREET   21202  
3            411 BALTIMORE STREET EAST   21202  
4                36 EUTAW STREET SOUTH   21201  
5                 821 NORTH POINT ROAD   21237  
8            427 BALTIMORE STREET EAST   21202  
...                                ...     ...  
7546                  6622 BELAIR ROAD   21206  
7644                  500 MAUDE AVENUE   21225  
7671                  6656 BELAIR ROAD   21206  
7673  3300 CLIPPER MILL ROAD, STALL #7   21201  
7700             1014 36TH STREET WEST   21211  

[858 rows x 5 columns]

As we can see, the number of rows decreased significantly.

Filtering out mislabelled establishments¶

All of the entries marked Adult are what we expected, so we won't have to verify those. Unfortunately, some of the entries marked as Tavern and Package goods only are actually restaurants or grocery stores.

  1. For Tavern, some of the listed places do not even seem to sell alcohol, so these definitely have to be filtered out
  2. In the case of the grocery stores, they do appear to sell Liquor. The Hopkins article that I cited earlier argued that liquor "outlets" led to increased crime, and so long as these grocery stores sell alcohol, they fall under "outlets" as well. However, I would still like to distinguish between traditional liquor stores and grocery stores, so we will still look into these.

To classify the businesses, we will have to use the Google Maps Place Search API to categorize each of these as bar, liquor_store or neither.

In order to use the API, we need to have the name of the business.

In [68]:
stores_without_names = liquor_df[liquor_df['TradeName'] == 'N/A']
print(stores_without_names)
     geometry TradeName   EstablishmentDesc                 AddrStreet AddrZip
2329     None       N/A              Tavern        4017 EASTERN AVENUE   21224
3189     None       N/A              Tavern  1001 CHARLES STREET NORTH   21201
3412     None       N/A              Tavern         2322 BOSTON STREET   21224
3466     None       N/A              Tavern   4100 LOMBARD STREET EAST   21224
3480     None       N/A              Tavern  3432 HANOVER STREET SOUTH   21225
3502     None       N/A              Tavern         4600 CURTIS AVENUE   21226
3526     None       N/A              Tavern           5307 BELAIR ROAD   21206
3580     None       N/A               Adult   4100 LOMBARD STREET EAST   21224
4532     None       N/A              Tavern          1739 LIGHT STREET   21230
5024     None       N/A              Tavern   623 LUZERNE AVENUE SOUTH   21224
5267     None       N/A              Tavern          5517 HARFORD ROAD   21214
5366     None       N/A  Package goods only      249 CHASE STREET WEST   21201
5481     None       N/A              Tavern         709 BROADWAY SOUTH   21231
5797     None       N/A              Tavern      2903 O'DONNELL STREET   21224
6165     None       N/A              Tavern             5407 YORK ROAD   21212
6648     None       N/A              Tavern         2218 BOSTON STREET   21231

Let's drop these rows that don't have names, since they likely correspond to misinputs.

In [69]:
liquor_df = liquor_df[liquor_df['TradeName'] != 'N/A'].reset_index(drop=True)

Now, let's look up the details for the remaining rows. We'll write a function which will construct a search query from a row in the format: <business name> <zip code>, and plug that into our previous function. In my experimentation, this form of query works with the highest accuracy.

In [70]:
def find_place_with_tradename_zip_code(row):
    search_query = f"{row['TradeName']} {row['AddrZip']}"

    return google_find_place_from_text_params(search_query)

Finally, let's use our functions to plug all of our rows into the API.

In [75]:
from json import loads, dumps

results1 = run_requests(map_df(find_place_with_tradename_zip_code, liquor_df))
print(dumps(results1[:2], indent=2))
[
  {
    "candidates": [],
    "status": "ZERO_RESULTS"
  },
  {
    "candidates": [
      {
        "geometry": {
          "location": {
            "lat": 39.2895378,
            "lng": -76.6097109
          },
          "viewport": {
            "northeast": {
              "lat": 39.29095042989272,
              "lng": -76.60837092010728
            },
            "southwest": {
              "lat": 39.28825077010728,
              "lng": -76.61107057989273
            }
          }
        },
        "name": "The Red Room Cabaret",
        "place_id": "ChIJre21q50EyIkRMAAFZPZjnXU",
        "types": [
          "point_of_interest",
          "establishment"
        ]
      }
    ],
    "status": "OK"
  }
]

The very first search resulted in ZERO_RESULTS. Let's see how many of them there were.

In [77]:
zero_results = list(filter(lambda j: j['status'] == 'ZERO_RESULTS', results1))
print(len(zero_results))
57

These entries likely correspond to businesses that have been closed, so we will ignore them. Next, let's annotate our licenses dataframe with our new results.

Let's annotate our liquor_df with whether the rows correspond to bar and liquor_store. First, let's add a column corresponding to the response from the Google Places API.

In [78]:
liquor_df['google_response'] = results1

Before we move on, let's save another version of our data.

In [79]:
generic_export('liquor-licenses', liquor_df, version=1)
/home/sridaran/Development/Python/Environments/Data/lib/python3.11/site-packages/geopandas/array.py:968: RuntimeWarning: All-NaN slice encountered
  np.nanmin(b[:, 0]),  # minx
/home/sridaran/Development/Python/Environments/Data/lib/python3.11/site-packages/geopandas/array.py:969: RuntimeWarning: All-NaN slice encountered
  np.nanmin(b[:, 1]),  # miny
/home/sridaran/Development/Python/Environments/Data/lib/python3.11/site-packages/geopandas/array.py:970: RuntimeWarning: All-NaN slice encountered
  np.nanmax(b[:, 2]),  # maxx
/home/sridaran/Development/Python/Environments/Data/lib/python3.11/site-packages/geopandas/array.py:971: RuntimeWarning: All-NaN slice encountered
  np.nanmax(b[:, 3]),  # maxy

Now, let's define some functions we can use to determine if a row corresponds to a category. We want to automatically classify places that have certain words in their name, to account for locations which the Places API does not have record of. We will say that a row corresponds to category X if the Places API says that it has the correct type, or if the name contains one of the magic keywords.

In [80]:
def test_string_contains_any(string, expected_substrings) -> bool:
    lowercase_string = string.lower()
    return any(map(lambda sub: (sub.lower() in lowercase_string), expected_substrings))

def make_is_type_column(row, wanted_type, automatic_keywords):
    try:
        first_candidate = row['google_response']['candidates'][0]
        res = (wanted_type in first_candidate['types']) or test_string_contains_any(first_candidate['name'], automatic_keywords)
    except:
        res = False

    return res or test_string_contains_any(row['TradeName'], automatic_keywords)

We will say that a bar must either have the bar type, OR have bar in the name. A liquor store corresponds to the liquor_store Places API type, and has the magic keywords liquor, wine and spirits.

In [81]:
liquor_df['is_bar'] = liquor_df.apply(make_is_type_column, axis=1, args=('bar', ['bar'])).astype('bool')
liquor_df['is_liquor_store'] = liquor_df.apply(make_is_type_column, axis=1, args=('liquor_store', ['liquor', 'wine', 'spirits'])).astype('bool')

Now, let's see which Tavern's don't actually correspond to bars.

In [82]:
is_fake_tavern = (liquor_df['EstablishmentDesc'] == 'Tavern') & (~liquor_df['is_bar'])
fake_taverns = liquor_df[is_fake_tavern]
print(fake_taverns)
    geometry                    TradeName EstablishmentDesc   
31      None                 CLUB LUZERNE            Tavern  \
35      None             KNIGHT'S LIQUORS            Tavern   
46      None               SHERRY'S PLACE            Tavern   
47      None                      YOUNG'S            Tavern   
50      None                    JEWEL BOX            Tavern   
..       ...                          ...               ...   
831     None                 PLAKA TAVERN            Tavern   
834     None                     RED DOOR            Tavern   
835     None  E.A.T. (EGGROLLS AND TACOS)            Tavern   
836     None                  CAFE CAMPLI            Tavern   
839     None                    KWIK MART            Tavern   

                      AddrStreet AddrZip   
31      542 LUZERNE AVENUE NORTH   21205  \
35      5139 PARK HEIGHTS AVENUE   21215   
46            238 BROADWAY SOUTH   21231   
47        2401 CHASE STREET EAST   21213   
50     419 BALTIMORE STREET EAST   21202   
..                           ...     ...   
831          4714 EASTERN AVENUE   21224   
834      625 GLOVER STREET NORTH   21205   
835            1371 ANDRE STREET   21230   
836  4801 HARFORD ROAD, SUITE S2   21214   
839             6656 BELAIR ROAD   21206   

                                       google_response  is_bar   
31   {'candidates': [{'geometry': {'location': {'la...   False  \
35   {'candidates': [{'geometry': {'location': {'la...   False   
46        {'candidates': [], 'status': 'ZERO_RESULTS'}   False   
47        {'candidates': [], 'status': 'ZERO_RESULTS'}   False   
50   {'candidates': [{'geometry': {'location': {'la...   False   
..                                                 ...     ...   
831       {'candidates': [], 'status': 'ZERO_RESULTS'}   False   
834  {'candidates': [{'geometry': {'location': {'la...   False   
835  {'candidates': [{'geometry': {'location': {'la...   False   
836  {'candidates': [{'geometry': {'location': {'la...   False   
839  {'candidates': [{'geometry': {'location': {'la...   False   

     is_liquor_store  
31             False  
35              True  
46             False  
47             False  
50             False  
..               ...  
831            False  
834             True  
835            False  
836            False  
839            False  

[205 rows x 8 columns]

Next, fake liquor stores.

In [84]:
is_fake_liquor_store = (liquor_df['EstablishmentDesc'] == 'Package goods only') & (~liquor_df['is_liquor_store'])
fake_liquor_stores = liquor_df[is_fake_liquor_store]
print(fake_liquor_stores.head())
   geometry                 TradeName   EstablishmentDesc   
24     None           TRINACRIA FOODS  Package goods only  \
29     None             KING'S KORNER  Package goods only   
30     None        CVS/PHARMACY #4107  Package goods only   
69     None  JENIS 7 STAR FOOD MARKET  Package goods only   
71     None          SECURED CREDITOR  Package goods only   

                  AddrStreet AddrZip   
24     406 PACA STREET NORTH   21201  \
29  1713 FEDERAL STREET EAST   21213   
30    6828 REISTERSTOWN ROAD   21215   
69    4220 PENNINGTON AVENUE   21226   
71          2300 OREM AVENUE   21217   

                                      google_response  is_bar  is_liquor_store  
24  {'candidates': [{'geometry': {'location': {'la...   False            False  
29  {'candidates': [{'geometry': {'location': {'la...   False            False  
30  {'candidates': [{'geometry': {'location': {'la...   False            False  
69  {'candidates': [{'geometry': {'location': {'la...   False            False  
71  {'candidates': [{'geometry': {'location': {'la...   False            False  

Next, let's drop any rows that don't match their expectations.

In [85]:
liquor_df = liquor_df.drop(liquor_df[is_fake_tavern | is_fake_liquor_store].index)
print(liquor_df)
    geometry                    TradeName   EstablishmentDesc   
0       None  DIVING HORSE GENTLEMAN CLUB               Adult  \
1       None                     RED ROOM               Adult   
2       None                      GODDESS               Adult   
3       None     WAGON WHEEL/PLAYERS CLUB               Adult   
4       None                   CIRCUS BAR               Adult   
..       ...                          ...                 ...   
832     None        HAMILTON PARK LIQUORS  Package goods only   
833     None           WONDERLAND LIQUORS              Tavern   
837     None             SIX PAX "N" MORE  Package goods only   
838     None      ACE LIQUORS & GROCERIES  Package goods only   
841     None                HAMPDEN YARDS              Tavern   

                     AddrStreet AddrZip   
0             5 COMMERCE STREET   21202  \
1     411 BALTIMORE STREET EAST   21202   
2         36 EUTAW STREET SOUTH   21201   
3          821 NORTH POINT ROAD   21237   
4     427 BALTIMORE STREET EAST   21202   
..                          ...     ...   
832  2323 NORTHERN PARKWAY EAST   21214   
833    2043 PENNSYLVANIA AVENUE   21217   
837            6622 BELAIR ROAD   21206   
838            500 MAUDE AVENUE   21225   
841       1014 36TH STREET WEST   21211   

                                       google_response  is_bar   
0         {'candidates': [], 'status': 'ZERO_RESULTS'}   False  \
1    {'candidates': [{'geometry': {'location': {'la...    True   
2    {'candidates': [{'geometry': {'location': {'la...   False   
3    {'candidates': [{'geometry': {'location': {'la...   False   
4    {'candidates': [{'geometry': {'location': {'la...    True   
..                                                 ...     ...   
832  {'candidates': [{'geometry': {'location': {'la...   False   
833  {'candidates': [{'geometry': {'location': {'la...    True   
837  {'candidates': [{'geometry': {'location': {'la...   False   
838  {'candidates': [{'geometry': {'location': {'la...   False   
841  {'candidates': [{'geometry': {'location': {'la...    True   

     is_liquor_store  
0              False  
1              False  
2              False  
3              False  
4              False  
..               ...  
832             True  
833             True  
837             True  
838             True  
841            False  

[583 rows x 8 columns]

Extracting Geometry¶

In [86]:
liquor_df['geometry'] = liquor_df['google_response'].apply(extract_geometry_from_places_request)

Let's see how many rows don't have geometry now.

In [87]:
liquor_without_geometry = liquor_df[liquor_df['geometry'].isnull()].copy(deep=True)
print(liquor_without_geometry)
    geometry                            TradeName   EstablishmentDesc   
0       None          DIVING HORSE GENTLEMAN CLUB               Adult  \
108     None                BACCHUS BAR & LIQUORS              Tavern   
123     None                BULL ROBINSON LIQUORS  Package goods only   
184     None                        REGAL LIQUORS  Package goods only   
244     None                     SHUSTER'S LIQUOR  Package goods only   
402     None  MADELINE DELI GROCERY BEER AND WINE  Package goods only   
502     None   BAYVIEW LIQUORS SPORTS BAR& LOUNGE              Tavern   
588     None                      PERRY'S LIQUORS  Package goods only   
591     None           IRVINGTON CUT RATE LIQUORS  Package goods only   
726     None                       GAYETY LIQUORS  Package goods only   
817     None           SEA TOP GROCERY AND LIQUOR  Package goods only   

                     AddrStreet AddrZip   
0             5 COMMERCE STREET   21202  \
108      1220 NORTH AVENUE WEST   21217   
123      2736 GREENMOUNT AVENUE   21218   
184     900 GILMOR STREET NORTH   21217   
244  1231 BALTIMORE STREET WEST   21223   
402     401 FURROW STREET SOUTH   21223   
502            3804 EASTERN AVE   21224   
588       2550 EDMONDSON AVENUE   21223   
591       4100 FREDERICK AVENUE   21229   
726   412 BALTIMORE STREET EAST   21202   
817      4420 PENNINGTON AVENUE   21226   

                                  google_response  is_bar  is_liquor_store  
0    {'candidates': [], 'status': 'ZERO_RESULTS'}   False            False  
108  {'candidates': [], 'status': 'ZERO_RESULTS'}    True             True  
123  {'candidates': [], 'status': 'ZERO_RESULTS'}   False             True  
184  {'candidates': [], 'status': 'ZERO_RESULTS'}   False             True  
244  {'candidates': [], 'status': 'ZERO_RESULTS'}   False             True  
402  {'candidates': [], 'status': 'ZERO_RESULTS'}   False             True  
502  {'candidates': [], 'status': 'ZERO_RESULTS'}    True             True  
588  {'candidates': [], 'status': 'ZERO_RESULTS'}   False             True  
591  {'candidates': [], 'status': 'ZERO_RESULTS'}   False             True  
726  {'candidates': [], 'status': 'ZERO_RESULTS'}   False             True  
817  {'candidates': [], 'status': 'ZERO_RESULTS'}   False             True  

For the remaining rows, let's geocode slightly differently, by plugging the address into the Places API rather than the name of the business and zip code. First, let's construct the Address column by combining street and zip code.

In [88]:
liquor_without_geometry['Address'] = liquor_without_geometry['AddrStreet'] + ' ' + liquor_without_geometry['AddrZip']

Next, let's plug this column into the API to get our locations.

In [89]:
def get_premise(row):
    search_query = row['Address']
    return google_find_place_from_text_params(search_query)

results = run_requests(map_df(get_premise, liquor_without_geometry))

Now, let's overwrite our previous google_response column with our new responses, and then use our extract_geometry function to extract the location from the responses.

In [90]:
liquor_without_geometry['google_response'] = results
liquor_without_geometry['geometry'] = liquor_without_geometry['google_response'].apply(extract_geometry_from_places_request)

Let's see how many more addresses we got from this approach.

In [91]:
print(len(liquor_without_geometry[~liquor_without_geometry['geometry'].isnull()]))
10

Any remaining rows will be dropped, since if they do not exist on Google they are likely statistically insignificant.

In [92]:
liquor_without_geometry = liquor_without_geometry.drop(liquor_without_geometry[liquor_without_geometry['geometry'].isnull()].index)

With that done, let's update our original dataframe with our new data.

In [93]:
liquor_df.update(liquor_without_geometry)

Now, any remaining null rows will be dropped, as they likely do not exist anymore.

In [94]:
print(liquor_df[liquor_df['geometry'].isnull()])
liquor_df = liquor_df.drop(liquor_df[liquor_df['geometry'].isnull()].index)
    geometry              TradeName EstablishmentDesc              AddrStreet   
108     None  BACCHUS BAR & LIQUORS            Tavern  1220 NORTH AVENUE WEST  \

    AddrZip                               google_response is_bar   
108   21217  {'candidates': [], 'status': 'ZERO_RESULTS'}   True  \

    is_liquor_store  
108            True  

Finally, let's commit our data.

In [95]:
generic_export('liquor-licenses', liquor_df, version=2)

Final touches¶

Next, let's add an is_strip_club column, derived directly from the EstablishmentDesc column.

In [96]:
liquor_df['is_strip_club'] = liquor_df['EstablishmentDesc'] == 'Adult'
print(liquor_df[liquor_df['is_strip_club'] == True].head())
                     geometry                    TradeName EstablishmentDesc   
0  POINT (-76.61007 39.28948)  DIVING HORSE GENTLEMAN CLUB             Adult  \
1  POINT (-76.60971 39.28954)                     RED ROOM             Adult   
2  POINT (-76.62099 39.28765)                      GODDESS             Adult   
3  POINT (-76.53867 39.30348)     WAGON WHEEL/PLAYERS CLUB             Adult   
4  POINT (-76.60913 39.28967)                   CIRCUS BAR             Adult   

                  AddrStreet AddrZip   
0          5 COMMERCE STREET   21202  \
1  411 BALTIMORE STREET EAST   21202   
2      36 EUTAW STREET SOUTH   21201   
3       821 NORTH POINT ROAD   21237   
4  427 BALTIMORE STREET EAST   21202   

                                     google_response is_bar is_liquor_store   
0  {'candidates': [{'geometry': {'location': {'la...  False           False  \
1  {'candidates': [{'geometry': {'location': {'la...   True           False   
2  {'candidates': [{'geometry': {'location': {'la...  False           False   
3  {'candidates': [{'geometry': {'location': {'la...  False           False   
4  {'candidates': [{'geometry': {'location': {'la...   True           False   

   is_strip_club  
0           True  
1           True  
2           True  
3           True  
4           True  

Now, we can remove EstablishmentDesc, since it's been superseded by the three is_<type> columns

In [97]:
liquor_df = liquor_df.drop('EstablishmentDesc', axis=1)

Finally, let's commit the last version of the data.

In [98]:
generic_export('liquor-licenses', liquor_df, version=3)

Finding Vacant Buildings¶

To find the vacant buildings, we will use Open Baltimore's Vacant Building Notices dataset.

In [99]:
download_url = r'https://opendata.arcgis.com/api/v3/datasets/70de1e9137ce455aaee58f38b281d2cc_1/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1'
download_cached(download_url, data_filepath('vacant-buildings', version=0))

vacant_buildings_df = generic_load('vacant-buildings', version=0)
In [100]:
print(vacant_buildings_df)
       OBJECTID NoticeNum                DateNotice  DateCancel  DateAbate   
0        723192   805231A 2012-01-25 15:36:59+00:00         NaN        NaN  \
1        723193  1780434A 2019-04-20 08:55:00+00:00         NaN        NaN   
2        723196   927919A 2013-02-01 14:19:00+00:00         NaN        NaN   
3        723197  2106499A 2022-05-13 16:20:00+00:00         NaN        NaN   
4        723202  2078873A 2022-02-10 16:01:00+00:00         NaN        NaN   
...         ...       ...                       ...         ...        ...   
14203    960932  2018569A 2021-07-15 13:28:59+00:00         NaN        NaN   
14204    960939  1581740A 2017-09-03 14:36:59+00:00         NaN        NaN   
14205    960966  2177271A 2022-11-03 16:20:00+00:00         NaN        NaN   
14206    960978  1695135A 2018-07-14 16:20:00+00:00         NaN        NaN   
14207    962285  1826489A 2019-08-10 12:40:59+00:00         NaN        NaN   

           NT OWNER_ABBR HousingMarketTypology2017 Council_District   
0      Vacant        NaN                         I                7  \
1      Vacant        NaN                         I                7   
2      Vacant        NaN                         I                7   
3      Vacant        NaN                         I                7   
4      Vacant        NaN                         I                7   
...       ...        ...                       ...              ...   
14203  Vacant        NaN                         A                1   
14204  Vacant        NaN                         J               12   
14205  Vacant        NaN                         B               11   
14206  Vacant        NaN                         I               12   
14207  Vacant        NaN                         H               13   

                                    Neighborhood  BLOCKLOT   
0      EASTERWOOD                                 0001 003  \
1      EASTERWOOD                                 0001 004   
2      EASTERWOOD                                 0001 007   
3      EASTERWOOD                                 0001 008   
4      EASTERWOOD                                 0001 013   
...                                          ...       ...   
14203                                FELLS POINT  1847 015   
14204                      EAST BALTIMORE MIDWAY  4004 015   
14205                                   DOWNTOWN  1351 016   
14206                      EAST BALTIMORE MIDWAY  4063 016   
14207                                MIDDLE EAST  1606 084   

                   Address                    geometry  
0         2041 W NORTH AVE  POINT (-76.65104 39.30943)  
1         2039 W NORTH AVE  POINT (-76.65099 39.30943)  
2         2033 W NORTH AVE  POINT (-76.65084 39.30943)  
3         2031 W NORTH AVE  POINT (-76.65080 39.30943)  
4         2021 W NORTH AVE  POINT (-76.65055 39.30944)  
...                    ...                         ...  
14203    1935 ALICEANNA ST  POINT (-76.58897 39.28342)  
14204     1034 E NORTH AVE  POINT (-76.60431 39.31196)  
14205          23 S GAY ST  POINT (-76.60845 39.28904)  
14206  2403 GREENMOUNT AVE  POINT (-76.60914 39.31685)  
14207     825 N MADEIRA ST  POINT (-76.58581 39.30052)  

[14208 rows x 13 columns]

We are not concerned with buildings which were only declared vacant after the end date of our crime data, so let's filter out those rows. First, let's convert the Timezone-Aware DateNotice column to a UTC NaiveDateTime, without the Timezone information. This will allow us to compare it to any other Naive datetime's that we have.

In [101]:
vacant_buildings_df['DateNotice'] = vacant_buildings_df['DateNotice'].dt.tz_convert(None)

Now, let's drop the rows we don't want.

In [102]:
crime_df = generic_load('crime-data')
last_crime_date = crime_df['CrimeDateTime'].max()

vacant_buildings_df = vacant_buildings_df.drop(vacant_buildings_df[vacant_buildings_df['DateNotice'] > last_crime_date].index)
vacant_buildings_df = vacant_buildings_df.reset_index(drop=True)

print(vacant_buildings_df['DateNotice'])
0      2012-01-25 15:36:59
1      2013-02-01 14:19:00
2      2016-01-28 14:47:00
3      2010-12-14 16:57:00
4      2008-02-08 12:29:00
               ...        
5579   2015-02-26 15:37:59
5580   2011-10-14 16:04:00
5581   2015-10-02 15:07:00
5582   2013-06-28 10:18:00
5583   2009-03-10 16:41:00
Name: DateNotice, Length: 5584, dtype: datetime64[ns]

That got rid of about 1/3 of the entries! Let's export our processed version of the data.

In [103]:
generic_export('vacant-buildings', vacant_buildings_df, version=1)

Exploratory Analysis & Data Visualization¶

This is where we will see what our data is telling us, so that we can make better judgements on what to look at for interpretation. Let's take a subset of our crime data, since it's currently difficult to work with on consumer hardware. Let's take data from the year 2015, which was the last full year in the data, and then do a stratified random sample, taking 33% of all data from each month. We will also choose a static random seed for deterministic results.

In [116]:
import numpy as np

random_seed = 12345

crime_df = generic_load('crime-data')
crime_codes_df = generic_load('crime-codes')

crime_df = crime_df.merge(crime_codes_df[['CODE', 'VIO_PROP_CFS']], how='left', left_on='CrimeCode', right_on='CODE').drop(columns='CODE')
crime_df['crime_severity'] = np.where(crime_df['VIO_PROP_CFS'] == 'VIOLENT', 5, 1)
crime_df = crime_df.drop('VIO_PROP_CFS', axis=1)

crime_df['CrimeMonth'] = crime_df['CrimeDateTime'].dt.month
subset_crime_df = crime_df[crime_df['CrimeDateTime'].dt.year == 2015].groupby(
    'CrimeMonth').apply(lambda m: m.sample(frac=0.20, random_state=random_seed)
    ).reset_index(drop=True)

Regions with the Most Crime¶

Let's group the crimes into CSA's using geopandas.sjoin. The function adds an index_right column with the index value of the intersected row on the right, but we will drop it since we don't need it.

In [117]:
baltimore_csa_df = generic_load('baltimore-csa')
subset_crime_df = subset_crime_df.set_crs(baltimore_csa_df.crs)
subset_crime_df = gpd.sjoin(subset_crime_df, baltimore_csa_df[['geometry', 'FID', 'Community']]).drop(columns=['index_right'])

Now, let's do a chloropleth map showing which regions have the most crime.

In [118]:
crimes_per_csa = subset_crime_df.groupby('FID').apply(lambda group: len(group)).reset_index(drop=True)
baltimore_csa_df['num_crimes'] = crimes_per_csa

First, there' s a lot of boilerplate for making choropleth maps using Plotly, so let's abstract this into a generic function we can use for making many different choropleths.

In [119]:
import plotly.express as px
import geopandas as gpd

# Convert GeoDataFrame to GeoJSON
geojson = loads(baltimore_csa_df.to_json())

def choropleth_baltimore(df, color_col, main_label, **kwargs):
    plotly_kwargs = {
        'data_frame': df,
        'geojson': geojson,
        'locations': df.index,
        'color': color_col,
        'color_continuous_scale': "Reds",
        'mapbox_style': "carto-positron",
        'zoom': 10,
        'center': {"lat": 39.29, "lon": -76.61},
        'range_color': [df[color_col].min(), df[color_col].max()],
        'opacity': 0.5,
        'labels': {color_col: main_label},
        'hover_name': 'Community'
    }

    plotly_kwargs.update(kwargs)
    return px.choropleth_mapbox(**plotly_kwargs)

Let's also make a function for drawing a scatterplot trace over a figure.

In [135]:
import plotly.graph_objects as go

def scatter_trace_baltimore(fig, df, hover_col=None, marker_size=5, **kwargs):
    plotly_kwargs = {
        'lon': df.geometry.x,
        'lat': df.geometry.y,
        'mode': 'markers',
        'marker': {'size': marker_size}
    } | {'hovertext': df[hover_col]} if hover_col else {}

    plotly_kwargs.update(kwargs)

    fig.add_trace(go.Scattermapbox(**plotly_kwargs))

Now, let's make a function for plotting crime.

In [122]:
def choropleth_crime(num_crimes_colname='num_crimes', **kwargs):
    helper_kwargs = {
        'color_col': num_crimes_colname,
        'main_label': 'Number of Crimes'
    }

    helper_kwargs.update(kwargs)
    return choropleth_baltimore(baltimore_csa_df, **helper_kwargs)

fig = choropleth_crime()

# Show the graph
fig.show()

These numbers are biased towards larger areas, so let's normalize the values using the normalized size of the area.

In [124]:
normalized_area = normalize_min_max(baltimore_csa_df.geometry.area)
baltimore_csa_df['num_crimes_normalized'] = normalize_min_max(
    baltimore_csa_df['num_crimes'] / (normalized_area + 1))
/tmp/ipykernel_2124824/2427232003.py:1: UserWarning:

Geometry is in a geographic CRS. Results from 'area' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.


In [125]:
choropleth_crime(num_crimes_colname='num_crimes_normalized').show()

Surprisingly, the graph looks the exact same after normalization.

Regions with the Highest Population Density¶

In [127]:
population_df = generic_load('population-density')
population_df['Community'] = baltimore_csa_df['Community']
choropleth_baltimore(population_df, 'density', 'Population Density (normalized)').show()

At a first glance, there doe't seem to be a huge correlation between population density and the normalized number of crimes.

Regions with the Lowest Median Income¶

In [129]:
median_income_df = generic_load('median-income')
median_income_df['Community'] = baltimore_csa_df['Community']
median_income_df['MedianIncomeNeg'] = -median_income_df['MedianIncome'] + 1
choropleth_baltimore(median_income_df, 'MedianIncomeNeg', 'Inverse Median Income (normalized)').show()

At first glance, there also isn't a huge correlation between median income and crime rate.

Annotating DataFrame with Liquor Licenses Data¶

In order to analyze liquor stores, bars and strip clubs, let's add this data to our DataFrame.

First, we'll categorize each establishment. Categorizing dataframes will be a common operation so we'll abstract it into a function.

In [131]:
liquor_df = generic_load('liquor-licenses')

def categorize_points_csa(df):
    return gpd.sjoin(df, baltimore_csa_df[['geometry', 'FID', 'Community']]).drop(columns=['index_right'])

liquor_df = categorize_points_csa(liquor_df.set_crs(baltimore_csa_df.crs))

Next, we'll count the total number of each type for each CSA. We'll make a generic function for this since this will be a common pattern. This is a function with side effects; it will update baltimore_csa_df with the new column.

In [132]:
def add_counts_per_type(df, keys, out_keys) -> None:
    global baltimore_csa_df

    counts_per_type = df.groupby('FID').apply(lambda group: pd.Series([len(group[group[key]]) for key in keys]) if keys else pd.Series([len(group)]))
    counts_per_type.columns = out_keys
    counts_per_type.name = 'counts_per_type'

    baltimore_csa_df = baltimore_csa_df.merge(counts_per_type, left_on='FID', right_index=True, how='left')
    baltimore_csa_df = baltimore_csa_df.fillna({key: 0 for key in counts_per_type.columns})
In [133]:
add_counts_per_type(liquor_df, ['is_liquor_store', 'is_bar', 'is_strip_club'], ['num_liquor_stores', 'num_bars', 'num_strip_clubs'])

Regions with the Most Liquor Stores¶

In [136]:
fig = choropleth_baltimore(baltimore_csa_df, 'num_liquor_stores', 'Number of liquor stores')
scatter_trace_baltimore(fig, liquor_df[liquor_df['is_liquor_store']], 'TradeName')

fig.show()

We can see that Southwest Baltimore, the region with the highest crime in the city, has the most liquor stores out of all the regions. This suggests a correlation between liquor stores and crime, but we will test that formally later on.

Regions with the Most Strip Clubs¶

In [137]:
fig = choropleth_baltimore(baltimore_csa_df, 'num_strip_clubs', 'Number of strip clubs')
scatter_trace_baltimore(fig, liquor_df[liquor_df['is_strip_club']], 'TradeName')

fig.show()

I am baffled by the 14 distinct strip clubs all right next to each other in Seton Hill. I looked up a couple of them and they all seemed to be different places too…

Other than that, it does seem that strip clubs have a strong correlation to crime.

Regions with the Most Bars¶

In [138]:
fig = choropleth_baltimore(baltimore_csa_df, 'num_bars', 'Number of bars')
scatter_trace_baltimore(fig, liquor_df[liquor_df['is_bar']], 'TradeName')

fig.show()

It seems that the area with the most bars is Fells Point, which was not one of the regions with the highest crime. This suggests that bars, in comparison to liquor stores and strip clubs, do not have a significant effect on crime.

Regions with the Most Vacant Buildings¶

Let's classify the vacant buildings as we've done with the other data.

In [139]:
vacant_df = generic_load('vacant-buildings')
vacant_df = categorize_points_csa(vacant_df)

Next, we'll count the total number of vacant buildings for each CSA.

In [140]:
add_counts_per_type(vacant_df, [], ['num_vacant_buildings'])
In [141]:
fig = choropleth_baltimore(baltimore_csa_df, 'num_vacant_buildings', 'Number of vacant buildings')
scatter_trace_baltimore(fig, vacant_df, 'Address')

fig.show()

As we can see, there are a ton of vacant buildings in Southwest Baltimore, after which is Harlem Park.

Police Stations¶

Let's count the number of police stations in each CSA.

In [142]:
stations_df = generic_load('police-stations')
stations_df = categorize_points_csa(stations_df)

Next, we'll count the total number of police stations for each CSA.

In [143]:
add_counts_per_type(stations_df, [], ['num_police_stations'])
In [145]:
fig = choropleth_baltimore(baltimore_csa_df, 'num_police_stations', 'Number of police stations')
scatter_trace_baltimore(fig, stations_df, 'address')

fig.show()

Despite having the highest crime rate, Southwest Baltimore does have a police station.

Correlating Our Factors¶

Let's see if there's a correlation between our collected factors.

In [146]:
import statsmodels.formula.api as smf
baltimore_csa_df['MedianIncome'] = median_income_df['MedianIncome']
baltimore_csa_df['Density'] = population_df['density']
model = smf.ols(formula='num_crimes_normalized ~ MedianIncome + Density + num_bars + num_strip_clubs + num_liquor_stores + num_vacant_buildings + num_police_stations', data=baltimore_csa_df).fit()
print(model.summary())
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     num_crimes_normalized   R-squared:                       0.593
Model:                               OLS   Adj. R-squared:                  0.533
Method:                    Least Squares   F-statistic:                     9.793
Date:                   Fri, 12 May 2023   Prob (F-statistic):           1.85e-07
Time:                           13:55:55   Log-Likelihood:                 30.523
No. Observations:                     55   AIC:                            -45.05
Df Residuals:                         47   BIC:                            -28.99
Df Model:                              7                                         
Covariance Type:               nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept                0.2135      0.057      3.764      0.000       0.099       0.328
MedianIncome            -0.0280      0.093     -0.302      0.764      -0.214       0.158
Density                 -0.0402      0.118     -0.341      0.734      -0.277       0.197
num_bars                -0.0011      0.004     -0.303      0.763      -0.008       0.006
num_strip_clubs          0.0204      0.008      2.491      0.016       0.004       0.037
num_liquor_stores        0.0346      0.010      3.439      0.001       0.014       0.055
num_vacant_buildings     0.0002      0.000      1.422      0.162    -9.2e-05       0.001
num_police_stations      0.0111      0.064      0.174      0.863      -0.118       0.140
==============================================================================
Omnibus:                       10.485   Durbin-Watson:                   1.880
Prob(Omnibus):                  0.005   Jarque-Bera (JB):               10.220
Skew:                           0.924   Prob(JB):                      0.00604
Kurtosis:                       4.022   Cond. No.                     1.42e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.42e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Let's get rid of the factors that seem improbable, namely income, density and police stations.

In [147]:
model = smf.ols(formula='num_crimes_normalized ~ num_bars + num_strip_clubs + num_liquor_stores + num_vacant_buildings', data=baltimore_csa_df).fit()
print(model.summary())
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     num_crimes_normalized   R-squared:                       0.568
Model:                               OLS   Adj. R-squared:                  0.534
Method:                    Least Squares   F-statistic:                     16.75
Date:                   Fri, 12 May 2023   Prob (F-statistic):           7.95e-09
Time:                           13:55:58   Log-Likelihood:                 29.849
No. Observations:                     56   AIC:                            -49.70
Df Residuals:                         51   BIC:                            -39.57
Df Model:                              4                                         
Covariance Type:               nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept                0.1943      0.035      5.602      0.000       0.125       0.264
num_bars                -0.0017      0.003     -0.507      0.615      -0.008       0.005
num_strip_clubs          0.0208      0.008      2.598      0.012       0.005       0.037
num_liquor_stores        0.0351      0.010      3.538      0.001       0.015       0.055
num_vacant_buildings     0.0002      0.000      1.337      0.187   -8.89e-05       0.000
==============================================================================
Omnibus:                        8.622   Durbin-Watson:                   1.985
Prob(Omnibus):                  0.013   Jarque-Bera (JB):                7.894
Skew:                           0.840   Prob(JB):                       0.0193
Kurtosis:                       3.748   Cond. No.                         422.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Based on the summary, it seems that there is almost a correlation between crime and population density. However, at the 95% confidence level we cannot state the correlation. On the other hand, median income seems to have no correlation at all.

In [148]:
model = smf.ols(formula='num_crimes_normalized ~ num_strip_clubs + num_liquor_stores + num_vacant_buildings', data=baltimore_csa_df).fit()
print(model.summary())
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     num_crimes_normalized   R-squared:                       0.566
Model:                               OLS   Adj. R-squared:                  0.541
Method:                    Least Squares   F-statistic:                     22.57
Date:                   Fri, 12 May 2023   Prob (F-statistic):           1.71e-09
Time:                           13:56:02   Log-Likelihood:                 29.708
No. Observations:                     56   AIC:                            -51.42
Df Residuals:                         52   BIC:                            -43.32
Df Model:                              3                                         
Covariance Type:               nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept                0.1958      0.034      5.710      0.000       0.127       0.265
num_strip_clubs          0.0203      0.008      2.570      0.013       0.004       0.036
num_liquor_stores        0.0319      0.008      4.147      0.000       0.016       0.047
num_vacant_buildings     0.0002      0.000      1.685      0.098   -3.89e-05       0.000
==============================================================================
Omnibus:                        8.415   Durbin-Watson:                   1.981
Prob(Omnibus):                  0.015   Jarque-Bera (JB):                7.655
Skew:                           0.831   Prob(JB):                       0.0218
Kurtosis:                       3.718   Cond. No.                         419.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Removing vacant buildings makes the R-squared drop even lower, so we will stick with this. Lastly, let's graph the model's predictions, and compare them to the real values.

In [149]:
predicted_crime_values = model.predict(baltimore_csa_df)
In [151]:
baltimore_csa_df['num_crimes_normalized_predicted'] = predicted_crime_values
fig = choropleth_crime(num_crimes_colname='num_crimes_normalized_predicted', main_label='Number of crimes predicted (normalized)')
fig.show()

As we can see, the basic model predicted the peaks correctly, and the rest of the colors were not too far off. To observe it more clearly, let's do a graph of residuals.

In [152]:
baltimore_csa_df['num_crimes_residuals'] = baltimore_csa_df['num_crimes_normalized'] - predicted_crime_values
fig = choropleth_crime(num_crimes_colname='num_crimes_residuals', main_label='(Expected - Actual) for normalized crime rate')
fig.show()

The linear regression model was not very accurate for pure prediction purposes, but it was enough to show a correlation between the variables.

Identifying Crime Hotspots¶

K-Means¶

To identify crime hotspots, we will use K-Means clustering. Let's iteratively find a value for K.

We're going to be using it with ML, so let's normalize the relevant values.

In [153]:
from scipy.cluster.vq import whiten

subset_crime_df_latlon = pd.DataFrame({
    'lon': subset_crime_df.geometry.x,
    'lat': subset_crime_df.geometry.y
})

normalized_df = whiten(subset_crime_df_latlon)

Now, let's train the KMeans model according to Tamjid Ahsan, in his article Selecting optimal K for K-means clustering.

In [158]:
from sklearn.cluster import KMeans
import sklearn.metrics as metrics
import matplotlib.pyplot as plt

search_range = range(2, 21)
report = {}
trained_models = {}
for k in search_range:
    temp_dict = {}
    kmeans = KMeans(init='k-means++',
                    algorithm='lloyd',
                    n_clusters=k,
                    n_init='auto',
                    max_iter=1000,
                    random_state=1,
                    verbose=0).fit(normalized_df)

    inertia = kmeans.inertia_
    temp_dict['Sum of squared error'] = inertia
    try:
        cluster = kmeans.predict(normalized_df)
        chs = metrics.calinski_harabasz_score(normalized_df, cluster)
        ss = metrics.silhouette_score(normalized_df, cluster)
        temp_dict['Calinski Harabasz Score'] = chs
        temp_dict['Silhouette Score'] = ss
        report[k] = temp_dict
        trained_models[k] = cluster
    except Exception as e:
        print(e)
        report[k] = temp_dict
In [159]:
report_df = pd.DataFrame(report).T
report_df.plot(figsize=(15, 10),
               xticks=search_range,
               grid=True,
               title=f'Selecting optimal "K"',
               subplots=True,
               marker='o',
               sharex=True)

plt.tight_layout()

It looks like 6 is the ideal number of clusters, since it has the highest Silhouette score, and also because the squared error seems to start flattening out at 6.. Let's draw these 6 clusters on the map.

In [161]:
k = 6
print(trained_models[k])

subset_crime_df['cluster'] = trained_models[k]

px.scatter_mapbox(
    subset_crime_df,
    lon = subset_crime_df.geometry.x,
    lat = subset_crime_df.geometry.y,
    color = 'cluster',
    # size = 7,
    opacity=0.6,
    mapbox_style='open-street-map',
    zoom=10
).show()
[0 0 0 ... 5 5 5]

The K-Means algorithm gave us 6 different clusters of crime, such that the distance within each cluster is minimized. Let's see how many points are in each cluster:

In [162]:
counts = np.bincount(trained_models[k])
print(counts)
[2820 1056  506 1417 1112 2744]

Based on this, we can say that the first and last clusters likely correspond to hotspots. On the graph, these two correlate to the center cross-section of Baltimore. This makes sense, since it has the highest proximity to everything else in the city.

Interpretation/Conclusion¶

In this project, we analyzed Baltimore crime data, and drew connections between different city-level factors and crime. along with crime data, we found the locations of all the police stations, the density of the population in the 56 CSA's, the median incomes, the liquor stores, bars and strip clubs, and all the vacant buildings. With more time, we could have explored even more factors, like education level, proximity to parks and gang activity.

These factors were all carefully chosen after preliminary research, with the assumption that they would naturally work. however, after reaching the end of the project, it has become clear that nothing in the real world is as straightforward as theory may lead you to believe. Crime is nowhere close to something that can easily be modelled with such a small amount of data. It involves countless sociopolitical facts of life, which would take far longer to analyze.

The results of this analysis showed that proximity to liquor stores and strip clubs are very likely factors that increase crime, and the rest of the factors were dismissed. However, it's quite possible that the dismissed factors were good factors themselves, but unfavorable sampling resulted in them coming up as irrelevant metrics.